Survival regression takes the linear combination and uses it to predict survival. But survival
data presents some special challenges:
Censoring: Censoring happens when the event doesn’t occur during the observation time of the
study (which, in human studies, means during follow-up). Before considering using survival
regression on your data, you need to evaluate the impact censoring may have on the results. You
can do this using life tables, the Kaplan-Meier method, and the log-rank test, as described in
Chapters 21 and 22.
Survival curve shapes: Some business disciplines develop models for estimating time to failure
of mechanical or electronic devices. They estimate the times to certain kinds of events, like a
computer’s motherboard wearing out or the transmission of a car going kaput, and find that they
follow remarkably predictable shapes or distributions (the most common being the Weibull
distribution, covered in Chapter 24). Because of this, these disciplines often use a parametric
form of survival regression, which assumes that you can represent the survival curves by algebraic
formulas. Unfortunately for biostatisticians, biological data tends to produce nonparametric
survival curves whose distributions can’t be represented by these parametric distributions.
As described earlier, nonparametric survival analyses using life tables, Kaplan-Meier plots, and log-
rank tests are limiting. But as biostatisticians, we could not rely on using parametric distributions in
our models; we wanted to use a hybrid, semi-parametric kind of survival regression. We wanted one
that was partly nonparametric, meaning it didn’t assume any mathematical formula for the shape of the
overall survival curve, and partly parametric, meaning we could use some parameter (or predicted
survival distribution shape) to guide our formulas the way other industries used the Weibull
distribution. In 1972, a statistician named David Cox developed a workable method for doing this. The
procedure is now called Cox proportional hazards regression, which we call PH regression for the
rest of this chapter for brevity. In the following sections, we outline the steps of performing a PH
regression.
Since 1972, many issues have been identified when using survival regression for biological
data, especially with respect to its appropriateness for the type of data. One way to examine this
is by running a logistic regression model (see Chapter 18) with the same predictors and outcome
as your survival regression model without including the time variable, and seeing if the
interpretation changes.
The steps to perform a PH regression
You can understand PH regression in terms of several conceptual steps, although when using statistical
software like is described in Chapter 4, it may appear that these steps take place simultaneously. That
is because the output created is designed for you — the biostatistician — to walk through the
following steps in your mind and make decisions. You must use the output to:
1. Determine the shape of the overall survival curve produced from the Kaplan-Meier method.